Machine Learning methods applied to Wine Quality
Analysis of the main aspects that concur to wine quality, using different machine learning techniques.
## Options of this document
#global chunks options
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
knitr::opts_chunk$set(message = TRUE)
#option for chunk caching.
cache = FALSE
#option for figure caching.
fig.cache = FALSE##Libraries used in this project
library(tidyverse)
library(mgcv)
library(plotly)
library(GGally)
library(gridExtra)
library(DT)
library(kernlab)
library(neuralnet)
library(e1071)
library(caret)
library(FSinR)
library(nnet)
library(modelr)1 Abstract
The wine industry is a significant contributor to the global economy, with numerous stakeholders, including producers, consumers, and experts, constantly seeking to improve wine quality. In this context, an anonymous wine producer asked our group to analyse and define the quality of his products, comparing them with a comprehensive dataset containing diverse wine samples and their corresponding quality ratings. To reach the purpose, a deep analysis of physicochemical aspects and their relations was fundamental. In this article, there are three suggestions about how to develop a satisfying model to assess wine quality.
2 Introduction
Grape variety, climate, production techniques, and chemical composition are some of the factors that influence wine quality. To meet customers expectations and maintain the reputation, wine producers have to accurately assess the quality of their wine. Traditional wine quality assessment methods rely on sensory evaluations of a sommelier, which could be expensive and affected by subjectivity. Therefore, machine learning can be a promising alternative to assess wine quality based on chemical properties, providing an objective approach.
The final aim of this project is to use machine learning techniques to build a predictive model that can assess the wine quality of our anonymous customer’s products. This project’s approach consists in understanding what physicochemical factors impact the most on wine quality and training three different machine learning algorithms to predict it. Using appropriate metrics, the results were evaluated. The resulting model can be used in the optimisation of the processes, as well as to assist consumers and retailers in selecting high quality wines.
2.1 Context
The goal of this study is to use machine learning techniques to estimate the quality of wine based on its physicochemical characteristics. In particular, our anonymous customer was searching for an efficient way to classify some of his products, based on a chemical analysis. The development of a machine learning model to assess wine quality can offer an objective, effective, and economical method to enhance decision making for wine producers, customers and merchants.
2.2 Cleaning the data:
The definitive data frame is made using two datasets that represent red and white wine. These datasets are initially read using the read.csv function, with the sep argument choosing a semicolon separator. Then, these datasets are stored in the red_wine and white_wine variables.
Data entry errors in the fields “volatile acidity” and “density” were fixed. In fact, in both categories the data were expressed in two different measurement units and have to be changed. This procedure helps to guarantee that the data are consistently arranged for examination. The program then purges the information and adds a new column with the name “Col” to the datasets for both red and white wine. This column is used to store the colour of the wine, which can be “Red” or “White”. This process helps to distinguish the two varieties of wine when they are mixed in the subsequent steps.
At the end of this code the wine dataframe is created by joining the cleaned datasets for red and white wine. The add_row method is used to merge the rows of the two databases. The structure of the combined dataframe wine is then shown using the str() method, giving an overview of the format and structure of the data.
red_wine <- read.csv("winequality-red.csv", header = TRUE)
white_wine <- read.csv("winequality-white.csv", header = TRUE)
red_wine[which(red_wine$volatile.acidity > 2), "volatile.acidity"] <- red_wine[which(red_wine$volatile.acidity > 2), "volatile.acidity"]/1000
red_wine[which(red_wine$density > 2), "density"] <- red_wine[which(red_wine$density > 2), "density"]/1000
white_wine[which(white_wine$volatile.acidity > 2), "volatile.acidity"] <- white_wine[which(white_wine$volatile.acidity > 2), "volatile.acidity"]/1000
white_wine[which(white_wine$density > 2), "density"] <- white_wine[which(white_wine$density > 2), "density"]/1000
red <- red_wine
red["Col"] <- "Red"
white <- white_wine
white["Col"] <- "White"
wine <- red
wine <- add_row(wine, white)
head(wine)## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 76
## 2 7.8 0.88 0.00 2.6 98
## 3 7.8 0.76 0.04 2.3 92
## 4 11.2 0.28 0.56 1.9 75
## 5 7.4 0.70 0.00 1.9 76
## 6 7.4 0.66 0.00 1.8 75
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality Col
## 1 5 Red
## 2 5 Red
## 3 5 Red
## 4 6 Red
## 5 5 Red
## 6 5 Red
3 General overview
In this paragraph, the reader can have a general overview of the dataset. The data were grouped into different categories, such as acidity, sugars, sulfites, general physicochemical aspects, and quality. The reader can also find a correlation plot, to have a better understanding of how the variables correlate with each other.
3.1 Corelation plot
In this section, the reader can see a matrix where each cell represents a pairwise link between two variables from the dataset. The only column that is missing is the outcome, quality. The matrix is divided into three parts, the main diagonal represents the density of each variable. In the lower triangular matrix, the reader can see all the pairwise scatterplots, while in the upper triangular matrix are stated the pairwise Pearson coefficient values.
ggpairs(wine,
columns = 1:11,
aes(color = Col,
alpha = 0.5)) +
scale_fill_manual(values = c("#a2231d", "#e4d96f")) +
scale_color_manual(values = c("#a2231d", "#e4d96f"))3.2 Acidity
In this section, the reader can examine bar plots and boxplots regarding four pivotal measurements of acidity. In particular, in this project, the studied measurements are fixed and volatile acidity, the concentration of citric acid and chloride concentration.
Fixed Acidity: it corresponds to the set of low volatility organic acids such as malic, lactic, tartaric or citric acids. The measurement unit is g/L
Volatile Acidity: it corresponds to the set of short-chain organic acids that can be extracted from the sample through a distillation process: formic acid, acetic acid, propionic acid and butyric acid. The measurement unit is g/L.
Citric Acid: it is often added to wines to increase acidity in the wine-making process, complement a specific flavour or prevent ferric hazes. The measurement unit is g/L.
Chlorides: the concentration of chloride ions is generally indicative of the presence of sodium chloride5. Sodium chloride adds to the saltiness of a wine, which can contribute to or detract from the overall taste and quality of the wine. The concentration is measured in mg/L.
The output can be examined to spot any potential variations in the chemical properties of red and white wines. For instance, the distributions of the fixed acidity and volatile acidity may differ significantly across red and white wines, although the concentrations of citric acid and chlorides may exhibit more minor variations. Combining bar plots and boxplots makes it easier to comprehend the variations in the distributions of different chemical properties, which might guide future research or feature choice in predictive modelling.
fixed_acidity_plot <- ggplot(data = wine, aes(x = fixed.acidity, fill = Col)) +
geom_bar(width = 0.25) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
facet_grid(~Col) +
ggtitle("Fixed Acidity Distribution") +
ylab("Frequency") +
xlab("Fixed Acidity")
volatile_acidity_plot <- ggplot(data = wine, aes(x = volatile.acidity, fill = Col)) +
geom_bar(width = 0.05) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
facet_grid(~Col) +
ggtitle("Volatile Acidity Distribution") +
ylab("Frequency") +
xlab("Volatile Acidity")
citric_acid_plot <- ggplot(data = wine, aes(x = citric.acid, fill = Col)) +
geom_bar(width = 0.05) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
facet_grid(~Col) +
ggtitle("Citric Acid Concentration Distribution") +
ylab("Frequency") +
xlab("Citric Acid Concentration")
chlorides_plot <- ggplot(data = wine, aes(x = chlorides, fill = Col)) +
geom_bar(width = 0.03) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
facet_grid(~Col) +
ggtitle("Chlorides Distribution") +
ylab("Frequency") +
xlab("Chlorides")
grid.arrange(fixed_acidity_plot, volatile_acidity_plot, citric_acid_plot, chlorides_plot, ncol = 2, nrow = 2)
The reader can see from the graphs the acidity measurements distribution. Fixed and Volatile Acidity, as well as Chlorides, are right skewed for both types of wine. The skewness could be crucial to gain insights about the impact that factors have on wine quality.
fixed_acidity_boxplot <- ggplot(data = wine, aes(x = Col, y = fixed.acidity, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
ggtitle("Fixed Acidity Boxplot") +
ylab("Fixed Acidity") +
xlab("")
volatile_acidity_boxplot <- ggplot(data = wine, aes(x = Col, y = volatile.acidity, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
ggtitle("Volatile Acidity Boxplot") +
ylab("Volatile Acidity") +
xlab("")
citric_acid_boxplot <- ggplot(data = wine, aes(x = Col, y = citric.acid, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
ggtitle("Citrid Acid Concentration Boxplot") +
ylab("Citrid Acid Concentration") +
xlab("")
chlorides_boxplot <- ggplot(data = wine, aes(x = Col, y = chlorides, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 13),
legend.position = "none"
) +
ggtitle("Chlorides Concentration Boxplot") +
ylab("Chlorides Concentration") +
xlab("")
grid.arrange(fixed_acidity_boxplot, volatile_acidity_boxplot, citric_acid_boxplot, chlorides_boxplot,
ncol = 2, nrow = 2)From the boxplots above, the reader can distinguish between the chemical characteristics of red and white wine. In addition, through the boxplot analysis, it is possible to spot variations in the general distribution of data. Red wines have a greater median fixed acidity than white wines in the fixed acidity boxplot, and the IQR for red wines is likewise larger. With very equal IQR for both types of wines, the volatile acidity boxplot shows that red wines have a greater median volatile acidity than white wines. With a narrower IQR for white wines, the citric acid concentration boxplot shows that white wines have a higher median citric acid content than red wines. Moreover, the boxplot of the chloride concentration reveals that red wines have a wider interquartile range (IQR) than white wines and a greater median chloride concentration.
3.3 Residual Sugar
Residual sugar concentration plays a crucial role in wine sweetness. Moreover, it contributes to the overall taste and mouthfeel of wine. Different concentrations of residual sugar are categorised in different wine styles.
Wine sweetness levels can be classified into different categories based on their residual sugar content:
- Bone dry: 0 - 1 g/L
- Dry: 1 - 17 g/L
- Off-dry: 17 - 35 g/L
- Medium Sweet: 35 - 120 g/L
- Sweet: 120 - 220+ g/L
Residual sugar is usually measured in grams per litre (g/L) or as a percentage of the wine’s total volume. Wines with higher residual sugar levels are considered sweeter, while those with lower levels are considered drier. The perception of sweetness, however, can be influenced by other factors such as acidity, tannins, and alcohol content.
These categories provide a general guideline for understanding the sweetness of a wine, but the individual perception of sweetness may vary. In addition, it is necessary to specify that it is not the purpose of this project to work with different classes of residual sugar concentration.
residual_sugar_plot <- ggplot(data = wine, aes(x = residual.sugar, fill = Col)) +
geom_bar(width = 2) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Residual Sugar Distribution") +
ylab("Frequency") +
xlab("Residual Sugar")
residual_sugar_plot residual_sugar_boxplot <- ggplot(data = wine, aes(x = Col, y = residual.sugar, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Residual Sugar Boxplot") +
ylab("Residual Sugar") +
xlab("")
residual_sugar_boxplotAbove the reader can have a look at the distribution of the residual sugar in the dataset. In particular, the distribution seems to be right-skewed for both red and white wines. In addition, looking at the boxplots, a relevant aspect is that on average residual sugar concentration is higher in white wines.
3.4 Sulfates
In wine-making sulphates, compounds containing sulphur, are often used to preserve the organoleptic properties of the wine. They prevent oxidation and bacterial growth, ensuring that the wine maintains its quality and freshness. Throughout the fermentation process, sulphates are naturally produced. However, sometimes it happens that winemakers add them in the form of potassium metabisulfite or other sulphate salts.
Although sulphates play an important role in preserving wine, their levels are regulated by law to ensure consumer safety.
In this project, the focus is on two different types of sulphates:
Free Sulphur Dioxide: The free SO2 is the unreacted component of sulphur dioxide present in the wine. Chemically speaking, it is made up of mostly the molecular (SO2) and bisulfite (HSO3 -) forms. The measurement unit is mg/L.
Total Sulphur Dioxide: Total SO2 is the total amount of SO2 added, or the sum of the free and bound fractions. The measurement unit is
Sulphates: In this case “Sulphates” states for potassium sulphates. In particular, the presence of potassium metabisulfite is fundamental in wine-making. It is an antioxidant and bactericide that releases sulphur dioxide into wine must. The measurement unit is g/L.
free_dioxide_sulfur_plot <- ggplot(data = wine, aes(x = free.sulfur.dioxide, fill = Col)) +
geom_bar(width = 5) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Free Sulfur Dioxide Concentration Distribution") +
ylab("Frequency") +
xlab("Free Sulfur Dioxide Concentration")
total_dioxide_sulfur_plot <- ggplot(data = wine, aes(x = total.sulfur.dioxide, fill = Col)) +
geom_bar(width = 10) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Total Sulfur Dioxide Concentration Distribution") +
ylab("Frequency") +
xlab("Total Sulfur Dioxide Concentration")
sulfites_plot <- ggplot(data = wine, aes(x = sulphates, fill = Col)) +
geom_bar(width = .05) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Sulfites Distribution") +
ylab("Frequency") +
xlab("Sulfites")
grid.arrange(free_dioxide_sulfur_plot, total_dioxide_sulfur_plot, sulfites_plot, ncol = 1, nrow = 3)free_sulfur_dioxide_boxplot <- ggplot(data = wine, aes(x = Col, y = free.sulfur.dioxide, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none",
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))
) +
ggtitle("Free Sulfur Dioxide Concentration Boxplot") +
ylab("Free Sulfur
values") +
xlab("")
total_sulfur_dioxide_boxplot <- ggplot(data = wine, aes(x = Col, y = total.sulfur.dioxide, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none",
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))
) +
ggtitle("Total Sulfur Dioxide Concentration Boxplot") +
ylab("Total Sulfur
values") +
xlab("")
sulphates_boxplot <- ggplot(data = wine, aes(x = Col, y = sulphates, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none",
axis.title.y = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0))
) +
ggtitle("Sulpfites Concentration Boxplot") +
ylab("Sulfites Concentration
values") +
xlab("")
grid.arrange(free_sulfur_dioxide_boxplot, total_sulfur_dioxide_boxplot, sulphates_boxplot,
ncol = 1, nrow = 3)The concentration of sulphates, as seen before, is right skewed for each category. In addition, the information that is available by looking at the boxplots is that the vast majority of the values are concentrated close to the median.
3.5 Other aspects
Wine sugar, alcohol and other dissolved solids inside the wine are responsible for the density value of the wine. Density in wine-making is essential to gather information regarding the fermentation progress.
The pH scale is a universal measurement unit to define the acidity or basicity of a solution. Wine’s pH depends on the impact of several factors such as grape variety, soil, environment, as well as methods employed. In addition, it is a common practice to alter the acidity levels, so the pH, to develop a particular taste or style. Understanding and controlling pH levels can aid winemakers in producing wines with the right flavour character and level of acidity. While acidity and pH are related, they are not interchangeable. A wine can have high acidity but relatively low pH if the acids present are weak, while a wine can have low acidity but high pH if the acids present are strong.
The alcohol degree is a measure of the amount of alcohol present in a given volume of wine, expressed in percentage. It is essential in determining the wine’s body, texture, and balance. Generally, the alcohol degree ranges between the values of 5% to 20%.
density_plot <- ggplot(data = wine, aes(x = density, fill = Col)) +
geom_bar(width = .0015) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Density Distribution") +
ylab("Frequency") +
xlab("Density")
pH_plot <- ggplot(data = wine, aes(x = pH, fill = Col)) +
geom_bar(width = .1) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("pH Distribution") +
ylab("Frequency") +
xlab("pH")
alcohol_plot <- ggplot(data = wine, aes(x = alcohol, fill = Col)) +
geom_bar(width = .5) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Alcohol Degree Distribution") +
ylab("Frequency") +
xlab("Alcohol Degree")
grid.arrange(density_plot, pH_plot, alcohol_plot, ncol = 1, nrow = 3)density_boxplot <- ggplot(data = wine, aes(x = Col, y = density, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Density Boxplot") +
ylab("Density") +
xlab("")
pH_boxplot <- ggplot(data = wine, aes(x = Col, y = pH, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("pH Boxplot") +
ylab("pH") +
xlab("")
alcohol_boxplot <- ggplot(data = wine, aes(x = Col, y = alcohol, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Alcohol Degree Boxplot") +
ylab("Alcohol Degree") +
xlab("")
grid.arrange(density_boxplot, pH_boxplot, alcohol_boxplot, ncol = 1, nrow = 3)The reader can see how the results tend to concentrate close to the median values in each category. Another interesting aspect is the pH distribution of wine which is quite symmetrical for both types of wine.
3.6 Quality
Wine quality is a mixture of several elements:
- Acidity: The right balance of acidity is crucial for a wine’s freshness, structure, and ageability.
- Tannins: Tannins, mainly found in red wines, contribute to the wine’s structure, texture, and ageing potential.
- Sugar: The residual sugar in wine can affect its sweetness, body, and balance.
- Alcohol: The alcohol content influences the wine’s body, texture, and overall balance.
- Aroma and flavour compounds: The wine’s aroma and flavour profile are essential for its complexity, appeal, and distinctiveness. These compounds can come from the grapes, fermentation process, or ageing vessels.
In summary, wine quality is a result of multiple factors in the production process, from grape cultivation to winemaking techniques. Attention to detail at each stage and a focus on maintaining balance and harmony among the various elements in the wine contribute to a high-quality end product.
The provided code generates an interactive bar plot that shows the distribution of wine quality for red and white wines. A scale from 1 to 10 was used to rate the wine quality for both red and white wines.
quality_plot <- ggplot(data = wine, aes(x = quality, fill = Col)) +
geom_bar(width = .5) +
facet_grid(~Col) +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Quality Distribution") +
ylab("Frequency") +
xlab("Quality")
ggplotly(quality_plot)quality_boxplot <- ggplot(data = wine, aes(x = Col, y = quality, fill = Col)) +
geom_boxplot() +
scale_fill_manual(name = "Type of wine: ",
values = c("#a2231d", "#e4d96f")) +
theme(
plot.title = element_text(color = "black", size = 20),
legend.position = "none"
) +
ggtitle("Quality Boxplot") +
ylab("Quality levels") +
xlab("")
ggplotly(quality_boxplot)The Quality plot provides a visual representation of the distribution of red and white wines based on their quality levels. We can observe that the majority of wines fall in the range of 5-7 for both red and white wines. The plot can help us gain insights into the quality of red and white wines and identify potential areas for improvement in wine production techniques.
Since the categories are too unbalanced, in terms of sample size, it was decided to group the quality value into three subcategories:
Low-quality wine: quality <= 5
Mid-quality wine: 5 < quality <= 6
High-quality wine: 7 <= quality <= 10
This simplification of the model can be problematic. However, it is the only logical solution to deal with such an unbalanced dataset.
red_wine[, "quality"] <- as.numeric(red_wine[, "quality"])
for (i in 1:nrow(red_wine)){
if (red_wine[i, "quality"] <= 5){
red_wine[i, "quality"] = 1
}
else if (red_wine[i, "quality"] == 6){
red_wine[i, "quality"] = 2
}
else if(red_wine[i, "quality"] > 6){
red_wine[i,"quality"] = 3
}
}
red_wine[, "quality"] <- as.factor(red_wine[, "quality"])
str(red_wine[, "quality"])## Factor w/ 3 levels "1","2","3": 1 1 1 2 1 1 1 3 3 1 ...
table(red_wine[, "quality"])##
## 1 2 3
## 744 638 217
white_wine[, "quality"] <- as.numeric(white_wine[, "quality"])
for (i in 1:nrow(white_wine)){
if (white_wine[i, "quality"] <= 5){
white_wine[i, "quality"] = 1
}
else if (white_wine[i, "quality"] == 6){
white_wine[i, "quality"] = 2
}
else if(white_wine[i, "quality"] > 6){
white_wine[i,"quality"] = 3
}
}
white_wine[, "quality"] <- as.factor(white_wine[, "quality"])
str(white_wine[, "quality"])## Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
table(white_wine[, "quality"])##
## 1 2 3
## 1640 2198 1060
4 Modelling
In the modelling section, the aim is to answer our anonymous customer’s questions. The techniques that were employed to fulfil the task are various and depend on the requirements.
These are the questions:
“In the company, we would like to have control of the value of density during the process. Unfortunately, our machines can calculate only chemical concentrations during wine-making. Can you find an easy-to-understand model to estimate the density starting from physicochemical factors?”
“Can you create a model that can predict the quality of a wine, given its physicochemical properties? If it is possible, what is the precision of the model, can we rely on it?”
In the following subsections, the reader can find our strategy to succeed in satisfying our customer needs.
4.1 Linear Regression Model
Red Wine
In this section, the aim is to answer the first question of our anonymous customer trying to keep it simple. In particular, linear regression can be an easy and very understandable way to address the problem. Therefore, two linear regression models were created.
The first one has as a response to the density, as our customer wanted, with two different predictors: pH and citric acid concentration. In addition, two cases were considered. In the first case, the linear model was not comprehensive of the interaction between pH and citric acid concentration while the second presents this feature. The models were compared using the method ANOVA. Moreover, the reader can consult the graph below to have a more clear idea of the situation.
### Red Wine
linear_model_3d_one_rw <- lm(data = red_wine, density ~ pH + citric.acid)
summary(linear_model_3d_one_rw)##
## Call:
## lm(formula = density ~ pH + citric.acid, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0079699 -0.0009415 -0.0000033 0.0010033 0.0063725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0043267 0.0011444 877.572 < 2e-16 ***
## pH -0.0024911 0.0003332 -7.476 1.26e-13 ***
## citric.acid 0.0024659 0.0002641 9.338 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001728 on 1596 degrees of freedom
## Multiple R-squared: 0.1625, Adjusted R-squared: 0.1615
## F-statistic: 154.9 on 2 and 1596 DF, p-value: < 2.2e-16
linear_model_3d_one_rw_al <- lm(data = red_wine, density ~ pH * citric.acid)
summary(linear_model_3d_one_rw_al)##
## Call:
## lm(formula = density ~ pH * citric.acid, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0080516 -0.0009490 0.0000061 0.0010482 0.0063830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0020314 0.0016139 620.879 < 2e-16 ***
## pH -0.0018046 0.0004763 -3.789 0.000157 ***
## citric.acid 0.0113743 0.0044287 2.568 0.010309 *
## pH:citric.acid -0.0027147 0.0013472 -2.015 0.044062 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001727 on 1595 degrees of freedom
## Multiple R-squared: 0.1646, Adjusted R-squared: 0.1631
## F-statistic: 104.8 on 3 and 1595 DF, p-value: < 2.2e-16
anova(linear_model_3d_one_rw, linear_model_3d_one_rw_al)## Analysis of Variance Table
##
## Model 1: density ~ pH + citric.acid
## Model 2: density ~ pH * citric.acid
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1596 0.0047671
## 2 1595 0.0047550 1 1.2105e-05 4.0606 0.04406 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this R output, the reader can check the summary of the linear regression for both cases, as well as the result of the model’s comparison. In this case, there is weak evidence to reject the null hypothesis. Therefore, to keep it simple, the first model is the chosen one (less complex, similar results).
## There is enough evidence to consider only the second model.
linear_plot_3d_one_rw <- plot_ly(data = red_wine, z= ~density, y= ~pH, x= ~citric.acid,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
linear_plot_3d_one_rw <- linear_plot_3d_one_rw %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Density'),
yaxis = list(title = 'pH'),
zaxis = list(title = 'Citric Acid Concentration')))
linear_plot_3d_one_rw(#fig:linear_model_red_wine)3D Scatterplot Model 1
The second model that was created to satisfy the needs of the client is a linear regression model that presents density as the response and predictors pH and fixed acidity concentration. Also, in this case, two models were created to differentiate the case with the interaction from the normal one.
linear_model_3d_two_rw <- lm(data = red_wine, density ~ pH + fixed.acidity)
summary(linear_model_3d_two_rw)##
## Call:
## lm(formula = density ~ pH + fixed.acidity, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0052890 -0.0007712 0.0001500 0.0009219 0.0056269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.807e-01 1.175e-03 834.950 <2e-16 ***
## pH 2.625e-03 3.047e-04 8.614 <2e-16 ***
## fixed.acidity 8.831e-04 2.702e-05 32.683 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001374 on 1596 degrees of freedom
## Multiple R-squared: 0.4709, Adjusted R-squared: 0.4702
## F-statistic: 710.2 on 2 and 1596 DF, p-value: < 2.2e-16
linear_model_3d_two_rw_al <- lm(data = red_wine, density ~ pH * fixed.acidity)
summary(linear_model_3d_two_rw_al)##
## Call:
## lm(formula = density ~ pH * fixed.acidity, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0052109 -0.0007631 0.0001356 0.0008772 0.0055997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9995418 0.0028257 353.729 < 2e-16 ***
## pH -0.0032475 0.0008581 -3.785 0.00016 ***
## fixed.acidity -0.0016475 0.0003475 -4.741 2.31e-06 ***
## pH:fixed.acidity 0.0007917 0.0001084 7.304 4.39e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001352 on 1595 degrees of freedom
## Multiple R-squared: 0.488, Adjusted R-squared: 0.487
## F-statistic: 506.8 on 3 and 1595 DF, p-value: < 2.2e-16
anova(linear_model_3d_two_rw, linear_model_3d_two_rw_al)## Analysis of Variance Table
##
## Model 1: density ~ pH + fixed.acidity
## Model 2: density ~ pH * fixed.acidity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1596 0.0030118
## 2 1595 0.0029143 1 9.7476e-05 53.349 4.392e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Above, the reader can have a look at the results. Both models seem to work well. However, when comparing the models with ANOVA it is clear that the second model is, in general, more appropriate (in this case, there is strong evidence to discard the null hypothesis).
linear_plot_3d_two_rw <- plot_ly(data = red_wine, z= ~density, y= ~pH, x= ~fixed.acidity,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
linear_plot_3d_two_rw <- linear_plot_3d_two_rw %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Density'),
yaxis = list(title = 'pH'),
zaxis = list(title = 'Fixed Acidity')))
linear_plot_3d_two_rw(#fig:linear_model_red_wine_two)3D Scatterplot Model 2
Finally, we compared the resulting models.
anova(linear_model_3d_one_rw, linear_model_3d_two_rw_al)## Analysis of Variance Table
##
## Model 1: density ~ pH + citric.acid
## Model 2: density ~ pH * fixed.acidity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1596 0.0047671
## 2 1595 0.0029143 1 0.0018528 1014 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As it was already clear from the previous summaries, the second model seems to be more performant.
To summarise, two different models were compared to find an appropriate result for calculating red wine density starting from other physicochemical parameters. The result is easy to understand. In addition, it works satisfactorily within the considered domain of the three variables. The R-2 and R-2 adjusted are around 50%, and the p-values for each category are extremely low, showing strong evidence to reject the null hypothesis.
However, the reader is invited to use the Shiny dashboard to experiment with new solutions in the 3D space. The result that is presented in this section comes from the author’s experiments and experiences with the dataset. If the readers can think of a different solution that can estimate the density, they are free to try.
White Wine
In the section above, the reader had the opportunity to go through our answer to the first question about red wines. Now the purpose is to perform the same process with white wine density as a response. As before, two linear regression models were considered.
The first model presents two different predictors: residual sugar concentration and total sulphur dioxide. Moreover, the cases considered were two, a linear regression without predictors interaction, and a linear regression with predictors interaction. Also, in this case, ANOVA was the method used for the comparison. The reader can scroll through the document to see all the summaries.
### White Wine
linear_model_3d_one_ww <- lm(data = white_wine, density ~ residual.sugar + total.sulfur.dioxide)
summary(linear_model_3d_one_ww)##
## Call:
## lm(formula = density ~ residual.sugar + total.sulfur.dioxide,
## data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0055800 -0.0009621 0.0000525 0.0009878 0.0184483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.890e-01 7.305e-05 13538.14 <2e-16 ***
## residual.sugar 4.402e-04 4.617e-06 95.36 <2e-16 ***
## total.sulfur.dioxide 1.620e-05 5.510e-07 29.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001501 on 4895 degrees of freedom
## Multiple R-squared: 0.7483, Adjusted R-squared: 0.7482
## F-statistic: 7277 on 2 and 4895 DF, p-value: < 2.2e-16
linear_model_3d_one_ww_al <- lm(data = white_wine, density ~ residual.sugar * total.sulfur.dioxide)
summary(linear_model_3d_one_ww_al)##
## Call:
## lm(formula = density ~ residual.sugar * total.sulfur.dioxide,
## data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0054504 -0.0009833 0.0000455 0.0010110 0.0181895
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.893e-01 1.178e-04 8396.215 < 2e-16 ***
## residual.sugar 3.872e-04 1.649e-05 23.482 < 2e-16 ***
## total.sulfur.dioxide 1.387e-05 8.870e-07 15.635 < 2e-16 ***
## residual.sugar:total.sulfur.dioxide 3.622e-07 1.081e-07 3.352 0.000809 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001499 on 4894 degrees of freedom
## Multiple R-squared: 0.7489, Adjusted R-squared: 0.7487
## F-statistic: 4865 on 3 and 4894 DF, p-value: < 2.2e-16
anova(linear_model_3d_one_ww, linear_model_3d_one_ww_al)## Analysis of Variance Table
##
## Model 1: density ~ residual.sugar + total.sulfur.dioxide
## Model 2: density ~ residual.sugar * total.sulfur.dioxide
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4895 0.011026
## 2 4894 0.011000 1 2.5253e-05 11.235 0.0008089 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, there is enough evidence to reject the null hypothesis. As a consequence, the first model that has to be considered is the model with interaction.
###White Wine
linear_model_3d_one_ww <- lm(data = white_wine, density ~ residual.sugar + total.sulfur.dioxide)
#summary(linear_model_3d_one_ww)
linear_plot_3d_one_ww <- plot_ly(data = white_wine, z= ~density, y= ~residual.sugar, x= ~total.sulfur.dioxide,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
linear_plot_3d_one_ww <- linear_plot_3d_one_ww %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Total Sulfur Dioxide'),
yaxis = list(title = 'Residual Sugar Concentration'),
zaxis = list(title = 'Density')))
linear_plot_3d_one_ww(#fig:linear_model_white_wine_first)3D Scatterplot Model 1
The second model presents density as the response and residual sugar concentration and alcohol degree as predictors. Two different types of models were created to differentiate the case with the interaction from the normal one.
linear_model_3d_two_ww <- lm(data = white_wine, density ~ alcohol + residual.sugar)
summary(linear_model_3d_two_ww)##
## Call:
## lm(formula = density ~ alcohol + residual.sugar, data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0020196 -0.0005852 -0.0001403 0.0004674 0.0249805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.005e+00 1.349e-04 7446.7 <2e-16 ***
## alcohol -1.226e-03 1.189e-05 -103.2 <2e-16 ***
## residual.sugar 3.607e-04 2.884e-06 125.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0009137 on 4895 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.9067
## F-statistic: 2.379e+04 on 2 and 4895 DF, p-value: < 2.2e-16
linear_model_3d_two_ww_al <- lm(data = white_wine, density ~ alcohol * residual.sugar)
summary(linear_model_3d_two_ww_al)##
## Call:
## lm(formula = density ~ alcohol * residual.sugar, data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0021315 -0.0005798 -0.0001328 0.0004588 0.0223174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.006e+00 1.940e-04 5185.069 < 2e-16 ***
## alcohol -1.370e-03 1.789e-05 -76.593 < 2e-16 ***
## residual.sugar 1.042e-04 2.416e-05 4.313 1.64e-05 ***
## alcohol:residual.sugar 2.561e-05 2.396e-06 10.690 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0009033 on 4894 degrees of freedom
## Multiple R-squared: 0.9088, Adjusted R-squared: 0.9088
## F-statistic: 1.627e+04 on 3 and 4894 DF, p-value: < 2.2e-16
anova(linear_model_3d_two_ww, linear_model_3d_two_ww_al)## Analysis of Variance Table
##
## Model 1: density ~ alcohol + residual.sugar
## Model 2: density ~ alcohol * residual.sugar
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4895 0.0040862
## 2 4894 0.0039929 1 9.3239e-05 114.28 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results in this case are impressive, the data seems to fit a plane perfectly. The R2 and R2 adjusted are around 90%, meaning that almost all the variance is explained by the models. Comparing the results with Anova led us to select the second model, the one with interaction.
linear_plot_3d_two_ww <- plot_ly(data = white_wine, z= ~density, y= ~residual.sugar, x= ~alcohol,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
linear_plot_3d_two_ww <- linear_plot_3d_two_ww %>% add_markers() %>%
layout(scene = list(xaxis = list(title = 'Residual Sugar Concentration'),
yaxis = list(title = 'Alcohol'),
zaxis = list(title = 'Density')))
linear_plot_3d_two_ww(#fig:linear_model_white_wine_second)3D Scatterplot Model 2
In this specific case, the resulting models were both appropriate enough to model reality. However, the second model was selected, since the R2 and R2 adjusted values are close to 1.
Also, in this case, the readers are invited to try their ideas and suggestions on the Shiny dashboard.
4.2 Non Linear models
In the Non-Linearities chapter, the final aim is still to answer the first question of our anonymous customer. The methodology is always the same, finding an easy linear model that can be used to estimate the density parameter based on the physicochemical properties of the wine.
The section will follow the following structure for both wine types considered:
Polynomial relationship in the Linear Model
Regression and Smooth Splines
Generalised Additive Model
Red wine
In this paragraph, the reader will find the exact steps followed to reach a solution that could be satisfactory. The response is always the density. However, in this case, the predictors can be squared using the I( ) or poly( ) functions. This methodology could be useful to understand if it is possible to create a slightly more complex model (quadratic polynomials) that may generate a more precise estimation of the density parameter.
In the first part, five different models are evaluated. In particular, there is a summary for every possible combination of I( ), poly( ) and linear terms. In the end, the two models with better R2 adjusted were extracted and compared using ANOVA.
quadratic_model_one_rw_1 <- lm(density ~ pH + poly(citric.acid, 2), data = red_wine)
summary(quadratic_model_one_rw_1) ##
## Call:
## lm(formula = density ~ pH + poly(citric.acid, 2), data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0078678 -0.0009631 -0.0000115 0.0010391 0.0064851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0051148 0.0011017 912.294 < 2e-16 ***
## pH -0.0025273 0.0003325 -7.601 4.98e-14 ***
## poly(citric.acid, 2)1 0.0190816 0.0020511 9.303 < 2e-16 ***
## poly(citric.acid, 2)2 0.0053977 0.0017246 3.130 0.00178 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001724 on 1595 degrees of freedom
## Multiple R-squared: 0.1676, Adjusted R-squared: 0.1661
## F-statistic: 107.1 on 3 and 1595 DF, p-value: < 2.2e-16
quadratic_model_one_rw_2 <- lm(density ~ pH + I(citric.acid**2), data = red_wine)
summary(quadratic_model_one_rw_2) ##
## Call:
## lm(formula = density ~ pH + I(citric.acid^2), data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0078610 -0.0009702 0.0000009 0.0010335 0.0065049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0048225 0.0010942 918.282 < 2e-16 ***
## pH -0.0025722 0.0003232 -7.958 3.28e-15 ***
## I(citric.acid^2) 0.0039596 0.0004017 9.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001723 on 1596 degrees of freedom
## Multiple R-squared: 0.1675, Adjusted R-squared: 0.1664
## F-statistic: 160.5 on 2 and 1596 DF, p-value: < 2.2e-16
quadratic_model_one_rw_3 <- lm(density ~ poly(pH, 2) + poly(citric.acid, 2), data = red_wine)
summary(quadratic_model_one_rw_3) ##
## Call:
## lm(formula = density ~ poly(pH, 2) + poly(citric.acid, 2), data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0073704 -0.0009730 -0.0000152 0.0010386 0.0065005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.967e-01 4.303e-05 23163.104 < 2e-16 ***
## poly(pH, 2)1 -1.541e-02 2.050e-03 -7.515 9.42e-14 ***
## poly(pH, 2)2 -4.437e-03 1.786e-03 -2.485 0.013070 *
## poly(citric.acid, 2)1 1.949e-02 2.054e-03 9.488 < 2e-16 ***
## poly(citric.acid, 2)2 6.527e-03 1.781e-03 3.665 0.000255 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001721 on 1594 degrees of freedom
## Multiple R-squared: 0.1708, Adjusted R-squared: 0.1688
## F-statistic: 82.11 on 4 and 1594 DF, p-value: < 2.2e-16
quadratic_model_one_rw_4 <- lm(density ~ poly(pH, 2) + citric.acid, data = red_wine)
summary(quadratic_model_one_rw_4) ##
## Call:
## lm(formula = density ~ poly(pH, 2) + citric.acid, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0076730 -0.0009486 -0.0000248 0.0010329 0.0063674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.961e-01 8.377e-05 11890.853 < 2e-16 ***
## poly(pH, 2)1 -1.523e-02 2.057e-03 -7.400 2.19e-13 ***
## poly(pH, 2)2 -2.767e-03 1.733e-03 -1.596 0.111
## citric.acid 2.501e-03 2.649e-04 9.443 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001727 on 1595 degrees of freedom
## Multiple R-squared: 0.1639, Adjusted R-squared: 0.1623
## F-statistic: 104.2 on 3 and 1595 DF, p-value: < 2.2e-16
quadratic_model_one_rw_5 <- lm(density ~ I(pH**2) + citric.acid, data = red_wine)
summary(quadratic_model_one_rw_5) ##
## Call:
## lm(formula = density ~ I(pH^2) + citric.acid, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0079131 -0.0009359 -0.0000066 0.0010102 0.0063697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 5.916e-04 1690.752 < 2e-16 ***
## I(pH^2) -3.766e-04 4.992e-05 -7.543 7.64e-14 ***
## citric.acid 2.465e-03 2.634e-04 9.361 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001728 on 1596 degrees of freedom
## Multiple R-squared: 0.163, Adjusted R-squared: 0.162
## F-statistic: 155.4 on 2 and 1596 DF, p-value: < 2.2e-16
## extract top results
anova(quadratic_model_one_rw_2, quadratic_model_one_rw_3)## Analysis of Variance Table
##
## Model 1: density ~ pH + I(citric.acid^2)
## Model 2: density ~ poly(pH, 2) + poly(citric.acid, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1596 0.0047390
## 2 1594 0.0047197 2 1.927e-05 3.254 0.03887 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows that the better model is the number three, the one with a second-grade polynomial for each predictor.
Subsequently, other five models are created. After careful consideration, model number one and model number three were tested with the ANOVA method.
quadratic_model_two_rw_1 <- lm(density ~ pH + poly(fixed.acidity, 2), data = red_wine)
summary(quadratic_model_two_rw_1) ##
## Call:
## lm(formula = density ~ pH + poly(fixed.acidity, 2), data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0051767 -0.0007582 0.0001339 0.0008884 0.0054965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9862880 0.0010327 955.088 < 2e-16 ***
## pH 0.0031587 0.0003117 10.133 < 2e-16 ***
## poly(fixed.acidity, 2)1 0.0637156 0.0018882 33.744 < 2e-16 ***
## poly(fixed.acidity, 2)2 -0.0091841 0.0014052 -6.536 8.49e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001356 on 1595 degrees of freedom
## Multiple R-squared: 0.4847, Adjusted R-squared: 0.4837
## F-statistic: 500.1 on 3 and 1595 DF, p-value: < 2.2e-16
quadratic_model_two_rw_2 <- lm(density ~ pH + I(fixed.acidity**2), data = red_wine)
summary(quadratic_model_two_rw_2) ##
## Call:
## lm(formula = density ~ pH + I(fixed.acidity^2), data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0064315 -0.0008023 0.0001151 0.0009289 0.0057868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.879e-01 1.073e-03 921.048 < 2e-16 ***
## pH 1.733e-03 3.023e-04 5.733 1.18e-08 ***
## I(fixed.acidity^2) 4.352e-05 1.447e-06 30.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001418 on 1596 degrees of freedom
## Multiple R-squared: 0.4363, Adjusted R-squared: 0.4356
## F-statistic: 617.6 on 2 and 1596 DF, p-value: < 2.2e-16
quadratic_model_two_rw_3 <- lm(density ~ poly(pH, 2) + poly(fixed.acidity, 2), data = red_wine)
summary(quadratic_model_two_rw_3) ##
## Call:
## lm(formula = density ~ poly(pH, 2) + poly(fixed.acidity, 2),
## data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0049036 -0.0007725 0.0001257 0.0008749 0.0055040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.967e-01 3.389e-05 29408.461 < 2e-16 ***
## poly(pH, 2)1 1.932e-02 1.925e-03 10.035 < 2e-16 ***
## poly(pH, 2)2 -2.567e-03 1.513e-03 -1.696 0.09 .
## poly(fixed.acidity, 2)1 6.378e-02 1.888e-03 33.793 < 2e-16 ***
## poly(fixed.acidity, 2)2 -8.032e-03 1.560e-03 -5.149 2.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001355 on 1594 degrees of freedom
## Multiple R-squared: 0.4856, Adjusted R-squared: 0.4843
## F-statistic: 376.2 on 4 and 1594 DF, p-value: < 2.2e-16
quadratic_model_two_rw_4 <- lm(density ~ poly(pH, 2) + fixed.acidity, data = red_wine)
summary(quadratic_model_two_rw_4) ##
## Call:
## lm(formula = density ~ poly(pH, 2) + fixed.acidity, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0046708 -0.0007866 0.0001228 0.0008954 0.0056065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.893e-01 2.273e-04 4353.036 < 2e-16 ***
## poly(pH, 2)1 1.676e-02 1.875e-03 8.939 < 2e-16 ***
## poly(pH, 2)2 -5.958e-03 1.373e-03 -4.340 1.52e-05 ***
## fixed.acidity 8.948e-04 2.701e-05 33.134 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001366 on 1595 degrees of freedom
## Multiple R-squared: 0.4771, Adjusted R-squared: 0.4761
## F-statistic: 485 on 3 and 1595 DF, p-value: < 2.2e-16
quadratic_model_two_rw_5 <- lm(density ~ I(pH**2) + fixed.acidity, data = red_wine)
summary(quadratic_model_two_rw_5) ##
## Call:
## lm(formula = density ~ I(pH^2) + fixed.acidity, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0053901 -0.0007668 0.0001537 0.0009244 0.0056277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.853e-01 6.750e-04 1459.565 <2e-16 ***
## I(pH^2) 3.818e-04 4.564e-05 8.366 <2e-16 ***
## fixed.acidity 8.773e-04 2.694e-05 32.565 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001375 on 1596 degrees of freedom
## Multiple R-squared: 0.4695, Adjusted R-squared: 0.4689
## F-statistic: 706.4 on 2 and 1596 DF, p-value: < 2.2e-16
## extract top results
anova(quadratic_model_two_rw_1, quadratic_model_two_rw_3)## Analysis of Variance Table
##
## Model 1: density ~ pH + poly(fixed.acidity, 2)
## Model 2: density ~ poly(pH, 2) + poly(fixed.acidity, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1595 0.0029332
## 2 1594 0.0029279 1 5.2855e-06 2.8775 0.09002 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the end, model number one was selected, since there was no evidence to discard the null hypothesis with a p-value on the F-statistic around 9%.
Below, the reader can have a look at the graphs representing the models from above. In particular, in the first graph, both x and y-axis values are squared, while in the second one, only the y-axis is squared, following the resulting model above.
quadratic_plot_3d_one_rw <- plot_ly(data = red_wine, z= ~density, y= ~pH**2, x= ~citric.acid**2,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
quadratic_plot_3d_one_rw <- quadratic_plot_3d_one_rw %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Citric Acid Concentration'),
yaxis = list(title = 'pH'),
zaxis = list(title = 'Density')))
quadratic_plot_3d_one_rw(#fig:quadratic_moodel_plot1)Both Predictors squared Scatterplot 3D, Model 1
quadratic_plot_3d_two_rw <- plot_ly(data = red_wine, z= ~density, y= ~pH, x= ~fixed.acidity**2,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
quadratic_plot_3d_two_rw <- quadratic_plot_3d_two_rw %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Fixed Acidity'),
yaxis = list(title = 'pH'),
zaxis = list(title = 'Density')))
quadratic_plot_3d_two_rw(#fig:quadratic_moodel_plot2)Fixed Acidity Predictor squared Scatterplot 3D, Model 2
For the sake of completeness, the reader can have a look at the GAM model for each linear relationship that was discussed above.
GAM (Generalised Additive Model) is a generalised linear model in which the response depends linearly on the unknown smooth function(s( )) of the predictors. In the output, the reader can go through the column edf (effective degree of freedom). The values in this column represent the complexity of the smooth (for example, an edf of 1 stands for a straight line, edf of 2 for a quadratic curve and so on…). Columns “Ref.df” and “F” contain the results of ANOVA tests to check the overall significance of a specific smooth. Finally, the column containing p-values shows the significance of smooth terms.
Another additional information is related to the distribution family of GAM and the link function. In this case, the link function was the identity function, while the big assumption here is that our data are normally distributed (Gaussian). This assumption is fundamental, however, it is not entirely true.
For these reasons, the aim of this part is just to provide a more complex modelling method to the readers, who can decide if they want to deepen the topic on their own. In conclusion, this method will not be considered when answering our anonymous customer.
gam_model_one_rw <- gam(density ~ s(pH) + s(citric.acid), data = red_wine)
summary(gam_model_one_rw)##
## Family: gaussian
## Link function: identity
##
## Formula:
## density ~ s(pH) + s(citric.acid)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.967e-01 4.253e-05 23436 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(pH) 5.15 6.179 12.09 <2e-16 ***
## s(citric.acid) 7.10 8.081 15.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.188 Deviance explained = 19.4%
## GCV = 2.9165e-06 Scale est. = 2.8923e-06 n = 1599
plot(gam_model_one_rw, residuals = TRUE, cex = 2)gam_model_one_rw <- gam(density ~ s(pH) + s(fixed.acidity), data = red_wine)
summary(gam_model_one_rw)##
## Family: gaussian
## Link function: identity
##
## Formula:
## density ~ s(pH) + s(fixed.acidity)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.967e-01 3.278e-05 30403 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(pH) 7.769 8.544 17.87 <2e-16 ***
## s(fixed.acidity) 7.172 8.123 159.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.518 Deviance explained = 52.2%
## GCV = 1.736e-06 Scale est. = 1.7187e-06 n = 1599
plot(gam_model_one_rw, residuals = TRUE, cex = 2)White wine
In this paragraph, it is possible to see the same procedure applied above to red wines. In this case, the dataset contains all the records of white wines. As before, the reader needs to understand that the aim is to keep it simple since a model that is too complex can be misleading. Therefore, using the function I( ) and poly( ), five different models are tested.
In the first part, the most performing model is the number three. The result is reached by using the method ANOVA. The model is using for both predictors the quadratic polynomial function. The R2 adjusted is around 76% and the p-values are all significant.
quadratic_model_one_ww_1 <- lm(density ~ residual.sugar + poly(total.sulfur.dioxide, 2), data = white_wine)
summary(quadratic_model_one_ww_1) #.7482##
## Call:
## lm(formula = density ~ residual.sugar + poly(total.sulfur.dioxide,
## 2), data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0053889 -0.0009573 0.0000479 0.0009827 0.0184489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.912e-01 3.650e-05 27153.544 <2e-16 ***
## residual.sugar 4.399e-04 4.622e-06 95.181 <2e-16 ***
## poly(total.sulfur.dioxide, 2)1 4.822e-02 1.639e-03 29.422 <2e-16 ***
## poly(total.sulfur.dioxide, 2)2 -1.921e-03 1.503e-03 -1.278 0.201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001501 on 4894 degrees of freedom
## Multiple R-squared: 0.7484, Adjusted R-squared: 0.7482
## F-statistic: 4852 on 3 and 4894 DF, p-value: < 2.2e-16
quadratic_model_one_ww_2 <- lm(density ~ residual.sugar + I(total.sulfur.dioxide**2), data = white_wine)
summary(quadratic_model_one_ww_2) #.7448##
## Call:
## lm(formula = density ~ residual.sugar + I(total.sulfur.dioxide^2),
## data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0081908 -0.0010078 0.0000477 0.0010316 0.0182535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.901e-01 4.403e-05 22486.45 <2e-16 ***
## residual.sugar 4.454e-04 4.605e-06 96.72 <2e-16 ***
## I(total.sulfur.dioxide^2) 5.131e-08 1.828e-09 28.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001511 on 4895 degrees of freedom
## Multiple R-squared: 0.7449, Adjusted R-squared: 0.7448
## F-statistic: 7147 on 2 and 4895 DF, p-value: < 2.2e-16
quadratic_model_one_ww_3 <- lm(density ~ poly(residual.sugar, 2) + poly(total.sulfur.dioxide, 2), data = white_wine)
summary(quadratic_model_one_ww_3) #7582##
## Call:
## lm(formula = density ~ poly(residual.sugar, 2) + poly(total.sulfur.dioxide,
## 2), data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0054382 -0.0009721 0.0000652 0.0010157 0.0052606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.940e-01 2.101e-05 47303.545 <2e-16 ***
## poly(residual.sugar, 2)1 1.550e-01 1.610e-03 96.268 <2e-16 ***
## poly(residual.sugar, 2)2 2.112e-02 1.482e-03 14.248 <2e-16 ***
## poly(total.sulfur.dioxide, 2)1 5.110e-02 1.619e-03 31.570 <2e-16 ***
## poly(total.sulfur.dioxide, 2)2 -2.274e-03 1.473e-03 -1.544 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001471 on 4893 degrees of freedom
## Multiple R-squared: 0.7584, Adjusted R-squared: 0.7582
## F-statistic: 3840 on 4 and 4893 DF, p-value: < 2.2e-16
quadratic_model_one_ww_4 <- lm(density ~ poly(residual.sugar, 2) + total.sulfur.dioxide, data = white_wine)
summary(quadratic_model_one_ww_4) #.7582##
## Call:
## lm(formula = density ~ poly(residual.sugar, 2) + total.sulfur.dioxide,
## data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0056644 -0.0009748 0.0000625 0.0010214 0.0052809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.917e-01 7.818e-05 12683.82 <2e-16 ***
## poly(residual.sugar, 2)1 1.551e-01 1.608e-03 96.46 <2e-16 ***
## poly(residual.sugar, 2)2 2.108e-02 1.482e-03 14.22 <2e-16 ***
## total.sulfur.dioxide 1.717e-05 5.443e-07 31.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001471 on 4894 degrees of freedom
## Multiple R-squared: 0.7583, Adjusted R-squared: 0.7582
## F-statistic: 5118 on 3 and 4894 DF, p-value: < 2.2e-16
quadratic_model_one_ww_5 <- lm(density ~ I(residual.sugar**2) + total.sulfur.dioxide, data = white_wine)
summary(quadratic_model_one_ww_5) #.6859##
## Call:
## lm(formula = density ~ I(residual.sugar^2) + total.sulfur.dioxide,
## data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.034974 -0.001132 0.000134 0.001169 0.004981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.895e-01 8.198e-05 12070.53 <2e-16 ***
## I(residual.sugar^2) 1.863e-05 2.343e-07 79.49 <2e-16 ***
## total.sulfur.dioxide 2.368e-05 5.890e-07 40.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001676 on 4895 degrees of freedom
## Multiple R-squared: 0.6861, Adjusted R-squared: 0.6859
## F-statistic: 5348 on 2 and 4895 DF, p-value: < 2.2e-16
## extract top results
anova(quadratic_model_one_ww_3, quadratic_model_one_ww_4)## Analysis of Variance Table
##
## Model 1: density ~ poly(residual.sugar, 2) + poly(total.sulfur.dioxide,
## 2)
## Model 2: density ~ poly(residual.sugar, 2) + total.sulfur.dioxide
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4893 0.010583
## 2 4894 0.010588 -1 -5.1552e-06 2.3835 0.1227
In the second chunk of code, the resulting model is the number four, which uses the quadratic polynomial function only on residual sugar concentration. The results are quite significant, showing that the R2 adjusted is around 92% and, looking at the p-value, all the predictors seem to be relevant for the model.
quadratic_model_two_ww_1 <- lm(density ~ residual.sugar + poly(alcohol, 2), data = white_wine)
summary(quadratic_model_two_ww_1) #.908##
## Call:
## lm(formula = density ~ residual.sugar + poly(alcohol, 2), data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0020934 -0.0005749 -0.0001250 0.0004479 0.0245036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.917e-01 2.304e-05 43048.483 <2e-16 ***
## residual.sugar 3.674e-04 2.979e-06 123.340 <2e-16 ***
## poly(alcohol, 2)1 -1.045e-01 1.025e-03 -101.969 <2e-16 ***
## poly(alcohol, 2)2 -7.809e-03 9.440e-04 -8.273 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0009074 on 4894 degrees of freedom
## Multiple R-squared: 0.908, Adjusted R-squared: 0.908
## F-statistic: 1.61e+04 on 3 and 4894 DF, p-value: < 2.2e-16
quadratic_model_two_ww_2 <- lm(density ~ residual.sugar + I(alcohol**2), data = white_wine)
summary(quadratic_model_two_ww_2) #.9079##
## Call:
## lm(formula = density ~ residual.sugar + I(alcohol^2), data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0020784 -0.0005754 -0.0001259 0.0004455 0.0246148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.980e-01 7.157e-05 13943.9 <2e-16 ***
## residual.sugar 3.659e-04 2.841e-06 128.8 <2e-16 ***
## I(alcohol^2) -5.630e-05 5.404e-07 -104.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0009076 on 4895 degrees of freedom
## Multiple R-squared: 0.9079, Adjusted R-squared: 0.9079
## F-statistic: 2.414e+04 on 2 and 4895 DF, p-value: < 2.2e-16
quadratic_model_two_ww_3 <- lm(density ~ poly(residual.sugar, 2) + poly(alcohol, 2), data = white_wine)
summary(quadratic_model_two_ww_3) #9195##
## Call:
## lm(formula = density ~ poly(residual.sugar, 2) + poly(alcohol,
## 2), data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0023291 -0.0005908 -0.0001010 0.0004421 0.0054783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.940e-01 1.212e-05 81990.840 <2e-16 ***
## poly(residual.sugar, 2)1 1.294e-01 9.895e-04 130.736 <2e-16 ***
## poly(residual.sugar, 2)2 2.260e-02 8.513e-04 26.544 <2e-16 ***
## poly(alcohol, 2)1 -1.066e-01 9.615e-04 -110.834 <2e-16 ***
## poly(alcohol, 2)2 -7.282e-03 8.829e-04 -8.248 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0008485 on 4893 degrees of freedom
## Multiple R-squared: 0.9196, Adjusted R-squared: 0.9195
## F-statistic: 1.399e+04 on 4 and 4893 DF, p-value: < 2.2e-16
quadratic_model_two_ww_4 <- lm(density ~ poly(residual.sugar, 2) + alcohol, data = white_wine)
summary(quadratic_model_two_ww_4) #.9184##
## Call:
## lm(formula = density ~ poly(residual.sugar, 2) + alcohol, data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0022620 -0.0005973 -0.0001115 0.0004628 0.0057897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.007e+00 1.178e-04 8547.72 <2e-16 ***
## poly(residual.sugar, 2)1 1.271e-01 9.575e-04 132.75 <2e-16 ***
## poly(residual.sugar, 2)2 2.276e-02 8.569e-04 26.55 <2e-16 ***
## alcohol -1.249e-03 1.115e-05 -112.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0008543 on 4894 degrees of freedom
## Multiple R-squared: 0.9185, Adjusted R-squared: 0.9184
## F-statistic: 1.838e+04 on 3 and 4894 DF, p-value: < 2.2e-16
quadratic_model_two_ww_5 <- lm(density ~ I(residual.sugar**2) + alcohol, data = white_wine)
summary(quadratic_model_two_ww_5) #.8787##
## Call:
## lm(formula = density ~ I(residual.sugar^2) + alcohol, data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0197420 -0.0006541 -0.0000540 0.0006093 0.0048127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.008e+00 1.407e-04 7163.9 <2e-16 ***
## I(residual.sugar^2) 1.557e-05 1.491e-07 104.4 <2e-16 ***
## alcohol -1.416e-03 1.294e-05 -109.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001042 on 4895 degrees of freedom
## Multiple R-squared: 0.8788, Adjusted R-squared: 0.8787
## F-statistic: 1.774e+04 on 2 and 4895 DF, p-value: < 2.2e-16
## extract top results
anova(quadratic_model_two_ww_3, quadratic_model_two_ww_4)## Analysis of Variance Table
##
## Model 1: density ~ poly(residual.sugar, 2) + poly(alcohol, 2)
## Model 2: density ~ poly(residual.sugar, 2) + alcohol
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4893 0.0035226
## 2 4894 0.0035716 -1 -4.898e-05 68.035 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Below, the reader can see two different graphs representing somehow the models from above.
quadratic_plot_one_ww <- plot_ly(data = white_wine, z= ~density, y= ~residual.sugar**2, x= ~total.sulfur.dioxide**2,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
quadratic_plot_one_ww <- quadratic_plot_one_ww %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Total Sulphur Dioxide Concentration squared'),
yaxis = list(title = 'Residual Sugar Concentration squared'),
zaxis = list(title = 'Density')))
quadratic_plot_one_ww(#fig:non_linear_white_wine_model1)Predictors squared Scatterplot 3D, Model 1
quadratic_plot_two_ww <- plot_ly(data = white_wine, z= ~density, y= ~residual.sugar**2, x= ~alcohol,
color= ~quality,
colors=c('#ffffb2','#fed976','#feb24c','#fd8d3c','#f03b20','#bd0026'),
size = 0.2)
quadratic_plot_two_ww <- quadratic_plot_two_ww %>% add_markers() %>% layout(scene = list(xaxis = list(title = 'Alcohol Degree'),
yaxis = list(title = 'Residual Sugar Concentration squared'),
zaxis = list(title = 'Density')))
quadratic_plot_two_ww(#fig:non_linear_white_wine_model2)Residual Sugar predictor squared Scatterplot 3D, Model 2
Also, in this case, the GAMs were created for completeness. These models can give a technical audience a tool to interpret and understand the results. To all the others, the suggestion is, if interested, to read the documentation at the end of the document. However, as specified previously, these models are not going to be considered when answering the problem’ question.
gam_model_one_ww <- gam(density ~ s(residual.sugar) + s(total.sulfur.dioxide), data = white_wine)
summary(gam_model_one_ww)##
## Family: gaussian
## Link function: identity
##
## Formula:
## density ~ s(residual.sugar) + s(total.sulfur.dioxide)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.940e-01 2.051e-05 48466 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(residual.sugar) 8.944 8.999 1116.5 <2e-16 ***
## s(total.sulfur.dioxide) 5.151 6.259 167.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.77 Deviance explained = 77%
## GCV = 2.0667e-06 Scale est. = 2.0603e-06 n = 4898
plot(gam_model_one_ww, residuals = TRUE, cex = 2)gam_model_two_ww <- gam(density ~ s(residual.sugar) + s(alcohol), data = white_wine)
summary(gam_model_two_ww)##
## Family: gaussian
## Link function: identity
##
## Formula:
## density ~ s(residual.sugar) + s(alcohol)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.94e-01 1.16e-05 85705 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(residual.sugar) 8.291 8.818 2274 <2e-16 ***
## s(alcohol) 6.931 7.946 1726 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.926 Deviance explained = 92.7%
## GCV = 6.6106e-07 Scale est. = 6.5887e-07 n = 4898
plot(gam_model_two_ww, residuals = TRUE, cex = 2)In this case, it is interesting to have a look at the results. Especially in the second game model, the summary indicates that the estimated degree of freedom for alcohol is around 7. However, from the graph, it is clear how a degree of 7 would overfit the data that are distributed along a line (grade 1 as in the resulting model from above). In addition, the residual sugar distribution seems to follow a parabolic distribution. Again, it is in line with the results using poly.
4.3 GLM
GLM stands for Generalised Linear Model. This is an advanced statistical modelling technique as well as a flexible generalisation of ordinary linear regression. It is an umbrella term that encompasses many other models.
In this project, the GLM method employed is the multinomial logistic regression. Through this method, it is possible to create a model to predict and estimate wine quality, starting from physicochemical properties. Therefore, this model should be considered when answering the second question of the project.
To be more precise, a multinomial logistic regression is a classification method that generalises logistic regression to multiclass problems. In this project, the aim is to classify wine quality, given a set of independent variables. These variables are the physicochemical properties of the wine.
Red wine
In this section, the reader can see the process of model creation. As mentioned above, the first examined model is the multinomial logistic regression.
It is important to specify that, in this case, the data partition created is used also in the next steps, for the SVM model and the ANN. In addition, some precautions have to be specified:
to enhance the experiment’s reproducibility, it is necessary to use the function “set.seed( )” to the same values as in the code.
the percentage of train data is 75%, while the remaining 25% constitutes the test datasets (one for testing and one for prediction).
the quality values have to be treated as “factor” in R. It is nonsensical to use a “numeric” data type.
Before going through the results, it is important to specify how to measure the significance of a model:
Accuracy: accuracy is the fraction of predictions our model got right.
Precision: precision attempts to answer the following question: What proportion of identifications was correct? The calculation is precision = (true positive)/(true positive + false positive)
Recall: recall attempts to answer the following question: What proportion of actual positives was identified correctly? The calculation is recall = (true positive)/(true positive + false negative)
F1 Score: F1 score is a machine learning evaluation metric that measures a model’s accuracy. It combines the precision and recall scores of a model.
red_wine[, "quality"] <- as.factor(red_wine[, "quality"])
set.seed(123)
indices <- createDataPartition(red_wine$quality, p = .75, list = F)
train_rw <- red_wine %>% slice(indices)
test_in_rw <- red_wine %>% slice(-indices) %>% select(-quality)
test_truth_rw <- red_wine %>% slice(-indices) %>% pull(quality)
multinomial_logistic_model_rw <- multinom(quality~ ., data = train_rw)## # weights: 39 (24 variable)
## initial value 1318.334746
## iter 10 value 1117.539397
## iter 20 value 912.142068
## iter 30 value 904.536614
## iter 40 value 903.054998
## iter 50 value 901.876993
## final value 901.876532
## converged
prediction_multinomial_model_rw <- predict(multinomial_logistic_model_rw, test_in_rw)
testing_rw_predictions <- confusionMatrix(prediction_multinomial_model_rw, test_truth_rw, mode = "everything")
testing_rw_predictions## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 140 62 5
## 2 42 87 27
## 3 4 10 22
##
## Overall Statistics
##
## Accuracy : 0.6241
## 95% CI : (0.5745, 0.6718)
## No Information Rate : 0.4662
## P-Value [Acc > NIR] : 1.7e-10
##
## Kappa : 0.363
##
## Mcnemar's Test P-Value : 0.008221
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.7527 0.5472 0.40741
## Specificity 0.6854 0.7125 0.95942
## Pos Pred Value 0.6763 0.5577 0.61111
## Neg Pred Value 0.7604 0.7037 0.91185
## Precision 0.6763 0.5577 0.61111
## Recall 0.7527 0.5472 0.40741
## F1 0.7125 0.5524 0.48889
## Prevalence 0.4662 0.3985 0.13534
## Detection Rate 0.3509 0.2180 0.05514
## Detection Prevalence 0.5188 0.3910 0.09023
## Balanced Accuracy 0.7191 0.6298 0.68341
The result shows an accuracy of around 63%. Looking at the precision and recall, it is clear that the model works discretely only for the first class, where the F1 score is around 72%. However, both the second and third classes present F1 scores of around 50%. To summarise, the model will be taken into account in the final discussion. However, its performances are not excellent.
White wine
In the script below, a multinomial logistic regression is applied to the dataset containing white wine data.
As before, the data partition will be used in the following section to deal with white wine dataset.
white_wine[, "quality"] <- as.factor(white_wine[, "quality"])
set.seed(123)
indices <- createDataPartition(white_wine$quality, p = .75, list = F)
train_ww <- white_wine %>% slice(indices)
test_in_ww <- white_wine %>% slice(-indices) %>% select(-quality)
test_truth_ww <- white_wine %>% slice(-indices) %>% pull(quality)
multinomial_logistic_model_ww <- multinom(quality~ ., data = train_ww)## # weights: 39 (24 variable)
## initial value 4036.301549
## iter 10 value 3693.440521
## iter 20 value 3257.242399
## iter 30 value 3244.930201
## iter 40 value 3230.541594
## iter 50 value 3224.604110
## iter 60 value 3224.103912
## iter 70 value 3223.787127
## iter 80 value 3223.740612
## iter 90 value 3223.425391
## final value 3223.233624
## converged
prediction_multinomial_model_ww <- predict(multinomial_logistic_model_ww, test_in_ww)
testing_ww_predictions <- confusionMatrix(prediction_multinomial_model_ww, test_truth_ww, mode = "everything")
testing_ww_predictions## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 232 128 20
## 2 174 367 151
## 3 4 54 94
##
## Overall Statistics
##
## Accuracy : 0.5662
## 95% CI : (0.5379, 0.5942)
## No Information Rate : 0.4485
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2952
##
## Mcnemar's Test P-Value : 1.014e-13
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.5659 0.6685 0.3547
## Specificity 0.8182 0.5185 0.9395
## Pos Pred Value 0.6105 0.5303 0.6184
## Neg Pred Value 0.7891 0.6579 0.8405
## Precision 0.6105 0.5303 0.6184
## Recall 0.5659 0.6685 0.3547
## F1 0.5873 0.5915 0.4508
## Prevalence 0.3350 0.4485 0.2165
## Detection Rate 0.1895 0.2998 0.0768
## Detection Prevalence 0.3105 0.5654 0.1242
## Balanced Accuracy 0.6920 0.5935 0.6471
In this case, the results seem to be not too significant. In particular, the accuracy is around 56% and the F1 scores are lower than 60% for each category. The worst result is reached in the third class, in which the F1 is around 45%
4.4 SVM
Support Vector Machines (SVM) are supervised learning models with associated learning algorithms to analyse data for classification and regression. This classifier can be used to perform linear classification, as well as non-linear classification using the kernel trick, mapping the inputs into high-dimensional feature spaces.
In this project, SVM is used to answer our anonymous customer’ second question. In particular, the algorithm will try to predict wine quality using as independent variables all the other physicochemical properties of the wine.
For what concerns the implementation, the parameters of the model were boosted using the tune( ) function. Afterwards, the model is trained and tested on the data partition created in the section before.
# tune_out_linear <- tune(svm, quality~., data = train_rw, kernel = "linear",
# ranges = list(cost = c(0.1,1,10,30)))
# tune_out_linear$best.model
#
# tune_out_radial <- tune(svm, quality~., data = train_rw, kernel = "radial",
# ranges = list(cost = c(0.1,1,10),
# gamma = c(0.5,1,2,3,4)))
# tune_out_radial$best.model
# tune_out_radial$best.model$gamma
# create the svm model (LINEAR)
red_wine_svm_linear_rw <- svm(quality~ ., train_rw, kernel = "linear",
scale = TRUE, cost = 0.1)
test_pred_linear_rw <- predict(red_wine_svm_linear_rw, test_in_rw)
confusion_matrix_linear_rw <- confusionMatrix(test_pred_linear_rw, test_truth_rw, mode = "everything")
confusion_matrix_linear_rw## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 140 66 5
## 2 46 93 49
## 3 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.584
## 95% CI : (0.5339, 0.6328)
## No Information Rate : 0.4662
## P-Value [Acc > NIR] : 1.556e-06
##
## Kappa : 0.2646
##
## Mcnemar's Test P-Value : 1.941e-12
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.7527 0.5849 0.0000
## Specificity 0.6667 0.6042 1.0000
## Pos Pred Value 0.6635 0.4947 NaN
## Neg Pred Value 0.7553 0.6872 0.8647
## Precision 0.6635 0.4947 NA
## Recall 0.7527 0.5849 0.0000
## F1 0.7053 0.5360 NA
## Prevalence 0.4662 0.3985 0.1353
## Detection Rate 0.3509 0.2331 0.0000
## Detection Prevalence 0.5288 0.4712 0.0000
## Balanced Accuracy 0.7097 0.5945 0.5000
## Second model, RADIAL
red_wine_svm_radial_rw <- svm(quality~ ., train_rw, kernel = "radial", gamma = 0.5,
scale = TRUE, cost = 1)
test_pred_radial_rw <- predict(red_wine_svm_radial_rw, test_in_rw)
confusion_matrix_radial_rw <- confusionMatrix(test_pred_radial_rw, test_truth_rw, mode = "everything")
confusion_matrix_radial_rw## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 156 52 7
## 2 29 105 26
## 3 1 2 21
##
## Overall Statistics
##
## Accuracy : 0.7068
## 95% CI : (0.6594, 0.751)
## No Information Rate : 0.4662
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4952
##
## Mcnemar's Test P-Value : 6.347e-07
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.8387 0.6604 0.38889
## Specificity 0.7230 0.7708 0.99130
## Pos Pred Value 0.7256 0.6562 0.87500
## Neg Pred Value 0.8370 0.7741 0.91200
## Precision 0.7256 0.6562 0.87500
## Recall 0.8387 0.6604 0.38889
## F1 0.7781 0.6583 0.53846
## Prevalence 0.4662 0.3985 0.13534
## Detection Rate 0.3910 0.2632 0.05263
## Detection Prevalence 0.5388 0.4010 0.06015
## Balanced Accuracy 0.7809 0.7156 0.69010
The result for the linear kernel SVM has to be discarded. The model has an accuracy of almost 60%, however, it is not able to correctly classify elements of the third group. Therefore, it is not possible to discuss F1 score, since not all the categories present this feature.
On the other hand, the radial kernel SVM is way more accurate. The accuracy is around 70%, the F1 score for the first class is more than 75%. As before, the third class presents an F1 score that is lower than the expectations. However, in this case, at least the precision seems to be quite significant.
White wine
The process is iterated for white wines. The model parameters are boosted using the function tune( ). Afterwards, two models were built. The first one is linear, the second one is radial.
# tune_out_linear <- tune(svm, quality~., data = train_ww, kernel = "linear",
# ranges = list(cost = c(0.1,1,10,30)))
# tune_out_linear$best.model
#
# tune_out_radial <- tune(svm, quality~., data = train_ww, kernel = "radial",
# ranges = list(cost = c(0.1,1,10),
# gamma = c(0.5,1,2,3,4)))
# tune_out_radial$best.model
# tune_out_radial$best.model$gamma
# create the svm model (LINEAR)
white_wine_svm_linear_ww <- svm(quality~ ., train_ww, kernel = "linear",
scale = TRUE, cost = 30)
test_pred_linear_ww <- predict(white_wine_svm_linear_ww, test_in_ww)
confusion_matrix_linear_ww <- confusionMatrix(test_pred_linear_ww, test_truth_ww, mode = "everything")
confusion_matrix_linear_ww## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 235 135 17
## 2 175 414 248
## 3 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5302
## 95% CI : (0.5018, 0.5585)
## No Information Rate : 0.4485
## P-Value [Acc > NIR] : 6.003e-09
##
## Kappa : 0.2002
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.5732 0.7541 0.0000
## Specificity 0.8133 0.3733 1.0000
## Pos Pred Value 0.6072 0.4946 NaN
## Neg Pred Value 0.7909 0.6512 0.7835
## Precision 0.6072 0.4946 NA
## Recall 0.5732 0.7541 0.0000
## F1 0.5897 0.5974 NA
## Prevalence 0.3350 0.4485 0.2165
## Detection Rate 0.1920 0.3382 0.0000
## Detection Prevalence 0.3162 0.6838 0.0000
## Balanced Accuracy 0.6932 0.5637 0.5000
## Second model, RADIAL
white_wine_svm_linear_ww <- svm(quality~ ., train_ww, kernel = "radial",
scale = TRUE, cost = 10, gamma = 0.5)
test_pred_linear_ww <- predict(white_wine_svm_linear_ww, test_in_ww)
confusion_matrix_linear_ww <- confusionMatrix(test_pred_linear_ww, test_truth_ww, mode = "everything")
confusion_matrix_linear_ww## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 281 88 15
## 2 119 401 84
## 3 10 60 166
##
## Overall Statistics
##
## Accuracy : 0.6928
## 95% CI : (0.6661, 0.7186)
## No Information Rate : 0.4485
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5138
##
## Mcnemar's Test P-Value : 0.02186
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.6854 0.7304 0.6264
## Specificity 0.8735 0.6993 0.9270
## Pos Pred Value 0.7318 0.6639 0.7034
## Neg Pred Value 0.8464 0.7613 0.8998
## Precision 0.7318 0.6639 0.7034
## Recall 0.6854 0.7304 0.6264
## F1 0.7078 0.6956 0.6627
## Prevalence 0.3350 0.4485 0.2165
## Detection Rate 0.2296 0.3276 0.1356
## Detection Prevalence 0.3137 0.4935 0.1928
## Balanced Accuracy 0.7794 0.7148 0.7767
As before the linear kernel SVM has to be discarded. The main reason is the absence of values predicted in the third class. This aspect may generate some problems in interpreting the model. Moreover, the result accuracy and F1 scores are not significant.
On the contrary, the radial kernel model seems to be stable. It has an accuracy and F1 score for each class of around 70%.
Cross Validation
In this subsection, a method to confirm the results is introduced. Cross Validation is a resampling method that uses different chunks of data to test and train a model on different iterations. In this case, the function crossv_kfold( ) is responsible for randomly selecting k “train and test” samples. Afterwards, three models are created, one for each training dataset. Lastly, the key figures and parameters of these three new SVM models are compared to understand which model performed best. From the summary and confusion matrix below, the reader can form an opinion on how diverse can be the results for each category.
This section aims to show the reader how to implement a cross-validation process. Therefore, only the example of red wine is introduced. The samples created are three.
## cross validation example
cross_validation_samples <- crossv_kfold(red_wine, k = 3)
train_cv_1 <- as.data.frame(cross_validation_samples$train$'1')
train_cv_2 <- as.data.frame(cross_validation_samples$train$'2')
train_cv_3 <- as.data.frame(cross_validation_samples$train$'3')
test_cv_1 <- as.data.frame(cross_validation_samples$test$'1')
test_cv_2 <- as.data.frame(cross_validation_samples$test$'2')
test_cv_3 <- as.data.frame(cross_validation_samples$test$'3')
#model 1
red_wine_svm_radial_rw <- svm(quality~ ., train_cv_1, kernel = "radial", gamma = 0.5,
scale = TRUE, cost = 10)
test_pred_radial_rw <- predict(red_wine_svm_radial_rw, test_cv_1[,1:11])
confusion_matrix_radial_rw <- confusionMatrix(test_pred_radial_rw, test_cv_1[,12], mode = "everything")
confusion_matrix_radial_rw## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 191 77 13
## 2 52 120 29
## 3 2 16 33
##
## Overall Statistics
##
## Accuracy : 0.6454
## 95% CI : (0.6031, 0.6861)
## No Information Rate : 0.4597
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4025
##
## Mcnemar's Test P-Value : 0.0008273
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.7796 0.5634 0.44000
## Specificity 0.6875 0.7469 0.96070
## Pos Pred Value 0.6797 0.5970 0.64706
## Neg Pred Value 0.7857 0.7199 0.91286
## Precision 0.6797 0.5970 0.64706
## Recall 0.7796 0.5634 0.44000
## F1 0.7262 0.5797 0.52381
## Prevalence 0.4597 0.3996 0.14071
## Detection Rate 0.3583 0.2251 0.06191
## Detection Prevalence 0.5272 0.3771 0.09568
## Balanced Accuracy 0.7335 0.6551 0.70035
#model 2
red_wine_svm_radial_rw <- svm(quality~ ., train_cv_2, kernel = "radial", gamma = 0.5,
scale = TRUE, cost = 10)
test_pred_radial_rw <- predict(red_wine_svm_radial_rw, test_cv_2[,1:11])
confusion_matrix_radial_rw <- confusionMatrix(test_pred_radial_rw, test_cv_2[,12], mode = "everything")
confusion_matrix_radial_rw## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 187 68 12
## 2 58 128 27
## 3 1 14 38
##
## Overall Statistics
##
## Accuracy : 0.6623
## 95% CI : (0.6204, 0.7024)
## No Information Rate : 0.4615
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4343
##
## Mcnemar's Test P-Value : 0.002616
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.7602 0.6095 0.49351
## Specificity 0.7213 0.7368 0.96711
## Pos Pred Value 0.7004 0.6009 0.71698
## Neg Pred Value 0.7782 0.7438 0.91875
## Precision 0.7004 0.6009 0.71698
## Recall 0.7602 0.6095 0.49351
## F1 0.7290 0.6052 0.58462
## Prevalence 0.4615 0.3940 0.14447
## Detection Rate 0.3508 0.2402 0.07129
## Detection Prevalence 0.5009 0.3996 0.09944
## Balanced Accuracy 0.7407 0.6732 0.73031
#model 3
red_wine_svm_radial_rw <- svm(quality~ ., train_cv_3, kernel = "radial", gamma = 0.5,
scale = TRUE, cost = 10)
test_pred_radial_rw <- predict(red_wine_svm_radial_rw, test_cv_3[,1:11])
confusion_matrix_radial_rw <- confusionMatrix(test_pred_radial_rw, test_cv_3[,12], mode = "everything")
confusion_matrix_radial_rw## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 199 66 10
## 2 50 134 31
## 3 4 15 24
##
## Overall Statistics
##
## Accuracy : 0.6698
## 95% CI : (0.6281, 0.7096)
## No Information Rate : 0.4747
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.4332
##
## Mcnemar's Test P-Value : 0.01586
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.7866 0.6233 0.36923
## Specificity 0.7286 0.7453 0.95940
## Pos Pred Value 0.7236 0.6233 0.55814
## Neg Pred Value 0.7907 0.7453 0.91633
## Precision 0.7236 0.6233 0.55814
## Recall 0.7866 0.6233 0.36923
## F1 0.7538 0.6233 0.44444
## Prevalence 0.4747 0.4034 0.12195
## Detection Rate 0.3734 0.2514 0.04503
## Detection Prevalence 0.5159 0.4034 0.08068
## Balanced Accuracy 0.7576 0.6843 0.66432
4.5 ANN
For completeness, the reader can have a look at the implementation of an ANN. ANNs are computing systems inspired by the biological neural networks that constitute animal brains. This is a supervised learning method to analyse data for classification and regression.
In this project, a possible implementation of the ANN model is mentioned. However, the complexity behind the scenes makes it difficult to apply and interpret the models. Therefore, this section is more about showing the reader a possible implementation of a solution rather than a real suggestion.
To be more specific, the neural networks created in this section present three hidden neurons and four layers. However, the complexity can be exponentially increased. What is truly problematic is defining the number of layers and what is working behind the function neuralnet ( ). In addition, an obstacle is finding the right threshold for the model to work.
The reader can discover this aspect directly in the code below.
###RED WINE
set.seed(123)
neural_net_model_rw <- neuralnet(quality~ ., train_rw, hidden = c(3,4), threshold = 0.5)
prediction_neural_net_model_rw <- compute(neural_net_model_rw, test_in_rw)
prediction_neural_net_model_rw <- apply(prediction_neural_net_model_rw$net.result, 1, which.max)
prediction_neural_net_model_rw <- factor(levels(test_truth_rw)[prediction_neural_net_model_rw], levels = levels(test_truth_rw))
confusion_matrix_nn_rw <- confusionMatrix(prediction_neural_net_model_rw, test_truth_rw,mode = "everything")
confusion_matrix_nn_rw## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 77 33 5
## 2 109 126 49
## 3 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5088
## 95% CI : (0.4586, 0.5589)
## No Information Rate : 0.4662
## P-Value [Acc > NIR] : 0.04901
##
## Kappa : 0.156
##
## Mcnemar's Test P-Value : < 2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.4140 0.7925 0.0000
## Specificity 0.8216 0.3417 1.0000
## Pos Pred Value 0.6696 0.4437 NaN
## Neg Pred Value 0.6162 0.7130 0.8647
## Precision 0.6696 0.4437 NA
## Recall 0.4140 0.7925 0.0000
## F1 0.5116 0.5688 NA
## Prevalence 0.4662 0.3985 0.1353
## Detection Rate 0.1930 0.3158 0.0000
## Detection Prevalence 0.2882 0.7118 0.0000
## Balanced Accuracy 0.6178 0.5671 0.5000
###WHITE WINE
# set.seed(123)
neural_net_model_ww <- neuralnet(quality~ ., train_ww, hidden = c(3,4), threshold = 0.5)
prediction_neural_net_model_ww <- compute(neural_net_model_ww, test_in_ww)
prediction_neural_net_model_ww <- apply(prediction_neural_net_model_ww$net.result, 1, which.max)
prediction_neural_net_model_ww <- factor(levels(test_truth_ww)[prediction_neural_net_model_ww], levels = levels(test_truth_ww))
confusion_matrix_nn_ww <- confusionMatrix(prediction_neural_net_model_ww, test_truth_ww, mode = "everything")
confusion_matrix_nn_ww## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 11 1 0
## 2 399 548 265
## 3 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.4567
## 95% CI : (0.4285, 0.4851)
## No Information Rate : 0.4485
## P-Value [Acc > NIR] : 0.2923
##
## Kappa : 0.0168
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.026829 0.9982 0.0000
## Specificity 0.998771 0.0163 1.0000
## Pos Pred Value 0.916667 0.4521 NaN
## Neg Pred Value 0.670792 0.9167 0.7835
## Precision 0.916667 0.4521 NA
## Recall 0.026829 0.9982 0.0000
## F1 0.052133 0.6224 NA
## Prevalence 0.334967 0.4485 0.2165
## Detection Rate 0.008987 0.4477 0.0000
## Detection Prevalence 0.009804 0.9902 0.0000
## Balanced Accuracy 0.512800 0.5072 0.5000
The results for the red wine dataset are close to the multinomial logistic regression. In particular, the first category seems to be well predicted (F1 score of 70%) while the other two classes are less significant.
For what concerns white wine, the situation is even worse. The greatest percentage of F1 score is around 57% for the first class. In addition, the accuracy of the model is just above 52%.
5 Conclusion
To summarise, in this project, the aim was to implement some models to deliver the most simple and usable answers to our anonymous customers. In particular, the document goes through a different set of Machine Learning tools, from simple Linear Regression to more advanced methods, such as SVM or ANN. The project structure is divided into two parts. The first is this technical document. It is not addressed only to the customer but also to all the readers interested in how to develop an analysis using the methodology illustrated above. On the other hand, the second part of our project is a Shiny Dashboard that can be useful to our customer to dynamically answer his questions. Moreover, he can also use this dashboard to broaden and, maybe, question the solutions proposed, creating new possibilities and playing with model features.
The main question of this project were: 1. “In the company, we would like to have control of the value of density during the process. Unfortunately, our machines can calculate only chemical concentrations during wine-making. Can you find an easy-to-understand model to estimate the density starting from physicochemical factors?”
- “Can you create a model that can predict the quality of a wine, given its physicochemical properties? If it is possible, what is the precision of the model, can we rely on it?”
Question 1
To solve the first questions three different methods were used: a linear regression, a linear model that was able to parametrise non linearities, and the GAM model. In each case only two predictors were chosen. This is due to the commitment to simplicity of this paper. In fact, more complex linear models were possible. However, the more the predictors, the less the interpretability. Since the aim is to provide a solution that can be easily used, the decision taken by our team is to find all the linear models that can be representable in a plane (therefore, 3D). To answer the question, the solution is dichotomised in an approach for red wine and another one for white wine. Concerning red wine, the model selected consists of density as a response and predictors pH and fixed acidity concentration. The R2 adjusted is around 50%, and the significance of each predictor is high enough (p-value close to zero). It could be a trivial way to model the density and find an estimation. Therefore, by employing this method, our customer can have the opportunity to estimate density while producing wine.
The second part of the solution consists in a model for white wines. To satisfy our customer’s requests, the designated model is the quadratic model number four. This linear model structure is the following: density as a response, quadratic polynomial function on the first predictor, residual sugar, and a linear second predictor, alcohol. In this case, the results are impressive. All the p-values are around zero. Moreover, the R2 adjusted is 92%. Therefore, this model was chosen since it ideally estimates density. Nevertheless, there exists a negative aspect. This model is slightly more complicated than a linear regression model on the same predictors. The suggestion is to use the right tools for prediction, as well as the Shiny App provided. If our customer thinks it could be too complicated, there is strong evidence that even the linear regression model can accurately fit the data (low p-values and 90% of R2 adjusted).
Question 2
The second question presents some difficulties that have to be mentioned before giving the solution.
The main problems are:
Quality assessments in our dataset are objective. As a result, the machine learning models based on supervised learning can have biassed results.
In the document, there are only three different quality classes. It could be not enough to represent the complexity of wine quality.
The dataset is strongly unbalanced, having the majority of the records in classes 1 and 2.
In this project, only a subset of machine learning methods were tested. It could be crucial to address the problem with different approaches.
For both red and white wine datasets, the radial kernel SVM is the most appropriate result. Through this model, it is possible to reach an overall F1 score of more than 70%. This percentage is remarkable. Moreover, it shows that the resulting model can be employed for predicting wine quality, under the supervision of an expert. In particular, the customer can have a dynamic answer to quality assessment in the Shiny App, where there is a dedicated section to test the model.
6 Shiny
In the last section, the aim is to give some information regarding the main product of this document. The Shiny App is an interactive dashboard that can be used to explore the solutions, examine them, why not, find new possible results. In fact, in this project, the readers had the opportunity to go through our solution to the problem. However, it is important to specify that a wine-maker has broader know-how. Indeed, the Shiny App can be a tool to simplify and connect personal knowledge and technology, even for a non-technical audience.